IDENTIC Corpus: Morphologically Enriched Indonesian-English Parallel Corpus
نویسنده
چکیده
This paper describes the creation process of an Indonesian-English parallel corpus (IDENTIC). The corpus contains 45,000 sentences collected from different sources in different genres. Several manual text preprocessing tasks, such as alignment and spelling correction, are applied to the corpus to assure its quality. We also apply language specific text processing such as tokenization on both sides and clitic normalization on the Indonesian side. The corpus is available in two different formats: ‘plain’, stored in text format and ‘morphologically enriched’, stored in CoNLL format. Some parts of the corpus are publicly available at the IDENTIC homepage.
منابع مشابه
Towards an Indonesian-English SMT System: A Case Study of an Under-Studied and Under-Resourced Language, Indonesian
This paper describes a work on preparing an Indonesian-English Statistical Machine Translation (SMT) System. It includes the creation of Indonesian morphological analyzer, MorphInd, and the composing of an Indonesian-English parallel corpus, IDENTIC. We build an SMT system using the state-of-the-art phrase-based SMT system, MOSES. We show several scenarios where the morphological tool is used t...
متن کاملComparing k-means clusters on parallel Persian-English corpus
This paper compares clusters of aligned Persian and English texts obtained from k-means method. Text clustering has many applications in various fields of natural language processing. So far, much English documents clustering research has been accomplished. Now this question arises, are the results of them extendable to other languages? Since the goal of document clustering is grouping of docum...
متن کاملIdentifying Indonesian-core vocabulary for teaching English to Indonesian preschool children: a corpus- based research
This corpus-based research focuses on building a corpus of Indonesian children’s storybooks to find the frequent content words in order to identify Indonesian-core vocabulary for teaching English to Indonesian preschool children. The data was gathered from 131 Indonesian children’s storybooks, which resulted in a corpus of 134,320 words. These data were run through a frequency menu in MonoConc ...
متن کاملSupervised Morphology Generation Using Parallel Corpus
Translating from English, a morphologically poor language, into morphologically rich languages such as Persian comes with many challenges. In this paper, we present an approach to rich morphology prediction using a parallel corpus. We focus on the verb conjugation as the most important and problematic phenomenon in the context of morphology in Persian. We define a set of linguistic features usi...
متن کاملBuilding Parallel Corpora for SMT System: A Case Study of English-Manipuri
The Statistical Machine Translation (SMT) systems are developed using sentence aligned parallel corpus. The difficulty is that there is no parallel corpus at the required measure for many language pairs. The preparation of large scale parallel corpus takes time and demands the linguistics skill. In the present work, the various issues of a quality parallel corpus and a technique that extracts p...
متن کامل